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[57] ABSTRACT 

A neural network is trained to output a time dependent 
target vector defined over a predetermined time inter- 
val in response to a time dependent input vector defined 
over the same time interval by applying corresponding 
elements of the error vector, or difference between the 
target vector and the actual neuron output vector, to 
the inputs of corresponding output neurons of the net- 
work corrective feedback. This feedback decreases the 
error and quickens the learning process, so that a much 
smaller number of training cycles are required to com- 
plete the learning process. A conventional gradient 
descent algorithm is employed to update the neural 
network parameters at the end of the predetermined 
time interval. The foregoing process is repeated in re- 
petitive cycles until the actual output vector corre- 
sponds to the target vector. In the preferred embodi- 
ment, as the overall error of the neutral network output 
decreases during successive training cycles, the portion 
of the error fed back to the output neurons is decreased 
accordingly, allowing the network to learn with greater 
freedom from teacher forcing as the network parame- 
ters converge to their optimum values. The invention 
may also be used to train a neural network with station- 
ary training and target vectors. 


39 Claims, 10 Drawing Sheets 
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FAST TEMPORAL NEURAL LEARNING USING 
TEACHER FORCING 

BACKGROUND OF THE INVENTION 5 
Origin of the Invention 

The invention described herein was made in the per- 
formance of work under a NASA contract, and is sub- 
ject to the provisions of Public Law 96-517 (35 USC io 
202) in which the contractor has elected not to retain 
title. 

Technical Field 

The invention relates to training neural networks 15 
with time dependent phenomena and to the problems 
associated therewith, including reducing the number of 
computations required and increasing the quality or 
fidelity of the neural network output. 

Background Art 

Recently, there has been a tremendous interest in 
developing learning algorithms capable of modeling 
time-dependent phenomena. In particular, considerable 
attention has been devoted to capturing the dynamics 25 
embedded in observed temporal sequences. 

In general, the neural architectures under consider- 
ation may be classified into two categories: 

* Feedforward networks, in which back propagation 
through time can be implemented. This architec- 30 
ture has been extensively analyzed, and is widely 
used in simple applications due, in particular, to the 
straightforward nature of its formalism. 

* Recurrent networks, also referred to as feedback or 
fully connected networks, which are currently 35 
receiving increased attention. A key advantage of 
recurrent networks lies in their ability to use infor- 
mation about past events for current computations. 
Thus, they can provide time-dependent outputs for 
both time-dependent as well as time-independent 40 
inputs. 

One may argue that, for many real world applica- 
tions, the feedforward networks suffice. Furthermore, a 
recurrent network can, in principle, be unfolded into a 
multilayer feedforward network. A detailed analysis of 45 
the merits and demerits of these two architectures is 
beyond the scope of this specification. Here, we will 
focus only on recurrent networks. 

The problem of temporal learning can typically be 
formulated as a minimization, over an arbitrary but 50 
finite time interval, of an appropriate error functional. 
The gradients of the functional with respect to the vari- 
ous parameters of the neural architecture, e.g., synaptic 
weights, neural gains, etc. are essential elements of the 
minimization process and, in the past, major efforts 55 
have been devoted to the efficacy of their computation. 
Calculating the gradients of a system’s output with 
respect to different parameters of the system is, in gen- 
eral, of relevance to several disciplines. Hence, a variety 
of methods have been proposed in the literature for 60 
computing such gradients. A recent survey of tech- 
niques which have been considered specifically for 
temporal learning can be found in Pearlmutter, B. A. 
(1990) “Dynamic recurrent neural networks,” Techni- 
cal Report CMU-CS-90-196, School of Computer Sci- 65 
ence, Carnegie Mellon University, Pittsburgh, Pa. We 
will briefly mention only those which are relevant to 
the present invention. 
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Sato proposed, at the conceptual level, an algorithm 
based upon Lagrange multipliers. However, his algo- 
rithm has not yet been validated by numerical simula- 
tions, nor has its computational complexity been ana- 
lyzed. Williams and Zipset [Williams, R. J., and Zipser, 
D. (1989) “A learning algorithm for continually running 
fully recurrent neural networks”, Neural Computation, 
Vol. 1, No. 2, pp. 270-280] presented a scheme in which 
the gradients of an error functional with respect to 
network parameters are calculated by direct differentia- 
tion of the neural activation dynamics. This approach is 
computationally very expensive and scales poorly to 
large systems. The inherent advantage of the scheme is 
the small storage capacity required, which scales as 
0(N 3 ), where N denotes the size of the network. 

Pearlmutter, on the other hand, described a varia- 
tional method which yields a set of linear ordinary dif- 
ferential equations for backpropagating the error 
through the system. These equations, however, need to 
be solved backwards in time, and require temporal stor- 
age of variables from the network activation dynamics, 
thereby reducing the attractiveness of the algorithm. 
Recently, the inventors herein [Toomarian, N. and Bar- 
hen, J. (1991) “Adjoint operators and non-adiabatic 
algorithms in neural networks,” Applied Mathematical 
Letters, Vol. 4, No. 2, pp. 69-73] suggested a frame- 
work formalism which enables the error propagation 
system of equations to be solved forward in time, con- 
comitantly with the neural activation dynamics. A 
drawback of this novel approach came from the fact 
that their equations had to be analyzed in terms of distri- 
butions, which precluded straightforward numerical 
implementation. Finally, Pineda proposed combining 
the existence of disparate time scales with a heuristic 
gradient computation. The underlying adiabatic as- 
sumptions and highly approximate gradient evaluation 
technique, however, placed severe limits on the applica- 
bility of his method. 

Analogy to real-life behavior motivates the learning 
paradigm of the present invention described below. 
Suppose that a parent wants to teach his child to ride a 
bicycle. Clearly, the parent will not stay home, let his 
child ride the bicycle and, from time to time, tell him 
how good or bad he is performing (just as it happens in 
classical supervised learning). The best way to train the 
child would be for the parent to accompany him during 
the riding sessions. This suggests that different dynami- 
cal systems should be considered for the two basic 
stages of learning and recall (or generalization). How- 
ever, the functional form of the neural dynamics used 
during the learning stage should smoothly evolve 
toward the functional form of the neural dynamics to be 
used during recall, after training is completed. In this 
context, the network dynamics during the learning 
stage should include an instantaneous signal from the 
teacher on its performance. This necessitates a mecha- 
nism for incorporating information regarding the de- 
sired output directly into the activation dynamics. Such 
a mechanism has been referred to as teacher forcing. 
Williams and Zipset [Williams, R. J., and Zipset, D. 
(1988) “A learning algorithm for continually running 
fully recurrent neural networks,” Technical Report ICS 
Report 8805, UCSD, La Jolla, Calif. 92093], to the best 
of our knowledge, have been the primary users of 
teacher forcing. They limited their algorithm to a dis- 
crete- time problem, replacing the output of the net- 
work with desired output values at each time step. 



5 , 428,710 


3 

SUMMARY OF THE INVENTION 

The present invention is a new continuous form of 
teacher forcing, and appropriately modifies the activa- 
tion dynamics of a simple additive neural network dur- 5 
ing its learning stage. The temporal modulation of 
teacher forcing is analyzed as learning proceeds, so that 
the activation dynamics of the learning stage can actu- 
ally be reduced to the activation dynamics of the recall 
stage. 1° 

In accordance with the invention, a neural network is 
trained to output a time dependent target vector defined 
over a predetermined time interval in response to a time 
dependent input vector defined over the same time 
interval by applying corresponding elements of the 15 
error vector, or difference between the target vector 
and the actual neuron output vector, to the inputs of 
corresponding output neurons of the network as correc- 
tive feedback. This feedback decreases the error and 

90 

quickens the learning process, so that a much smaller 
number of training cycles are required to complete the 
learning process. The learning process employs a con- 
ventional gradient descent algorithm to update the neu- 
ral network parameters (e.g., synapse weights and/or ^ 
neuron gains) at the end of the time interval. The fore- 
going process is repeated in repetitive cycles until the 
actual output vector corresponds to the target vector. It 
has been found that not only is the number of required 
training cycles decreased but that the quality or fidelity 3Q 
of the neural network output is significantly increased 
by the invention. In the preferred embodiment, as the 
overall error of the neural network output decreases 
during successive training cycles, the portion of the 
error fed back to the output neurons is decreased ac- 35 
cordingly, allowing the network to learn with greater 
freedom from teacher forcing as the network parame- 
ters converge to their optimum values. 

BRIEF DESCRIPTION OF THE DRAWINGS 

40 

FIG. la is a diagram of a neural network of the prior 
art. 

FIG. lb a diagram of a neural network training archi- 
tecture of the prior art. 

FIG. 2 is a diagram of a neural network numerical 45 
training architecture of the prior art including error 
feedback which nulls the error at each numerical step. 

FIG. 3 is a time domain diagram illustrating the be- 
havior of the neural network training architecture of 
FIG. 2. 50 

FIG. 4 is a simplified diagram of a neural network 
training architecture embodying the present, invention. 

FIG. 5 is a time domain diagram illustrating the be- 
havior of the neural network training architecture of 
FIG. 4. 55 

FIG. 6 is a system diagram corresponding to the 
neural network training architecture of FIG. 4. 

FIG. 7 is a flow diagram illustrating the operation of 
the neural network training architecture of FIG. 4 using 
a generic gradient descent algorithm for computing the 60 
neural network parameter changes during training. 

FIG. 8 is a system diagram illustrating a preferred 
embodiment of the system of FIG. 5. 

FIGS. 9 a and 9b together constitute a flow diagram 
illustrating the operation of the neural network training 65 
architecture of FIG. 4 for an embodiment employing a 
particular type of conventional gradient descent algo- 
rithm. 
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FIGS. 10, 11 12 illustrate different simulation results 
of a neural network learning a circular motion using the 
invention. 

FIG. 13 is a graph of the error as a function of the 
number of learning iterations for each of the cases illus- 
trated in FIGS. 10-12. 

FIGS. 14, 15 ad 16 illustrate different simulation re- 
sults of a neural network learning a figure-eight motion 
using the invention. 

FIG. 17 is a graph of the error as a function of the 
number of learning iterations for each of the cases illus- 
trated in FIGS. 14-16. 

DETAILED DESCRIPTION OF THE 
INVENTION 

Temporal Learning Framework: 

We formalize a neural network as an adaptive dynam- 
ical system whose temporal evolution is governed by 
the following set of coupled nonlinear differential equa- 
tions: 


+ K/i Un 



u m + In 



( 1 ) 


where u n represents the output of the nth neuron (u n (0) 
being the initial state), and T nm , denotes the strength of 
the synaptic coupling from the m-th to the n-th neuron. 
The constants K n characterize the decay of neuron ac- 
tivities. The sigmoidal functions g„(*) modulate the neu- 
ral responses, with gain given by y n ; typically, 
g„(y n x)=tanh(/3/,x). In order to implement a nonlinear 
functional mapping from an N/-dimensional input space 
to an Nodimensional output space, the neural network 
is topographically partitioned into three mutually exclu- 
sive regions. As shown in FIG. la . , the partition refers 
to a set of input neurons S/, a set of output neurons So, 
and a set of “hidden” neurons S h- Note that this archi- 
tecture is not formulated in terms of “layers” and that 
each neuron may be connected to all others including 
itself. _ 

Let a(t) (the overhead bar denotes a vector) be an N- 
dimensional vector of target temporal patterns, with 
non zero elements, a^(t), in the input and output sets 
only. When trajectories, rather than mappings, are con- 
sidered, components in the input set may also vanish. 
Hence, the time- dependent external input term in Eq. 
(1), i.e., I„(t), encodes component-contribution of the 
target temporal pattern via the expression 


( <2n{t) if neSi 
1 0 if n e Sh U Sq 


( 2 ) 


To proceed formally with the development of a tem- 
poral learning algorithm, we consider an approach 
based upon the minimization of an error functional, E, 
defined over the time interval [t 0 ,t^ by the following 
expression 


E{u,p) 


l o 


2 e n 2 dt 


C l f 

Fdt 

J to 


( 3 ) 
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where the error component, e n (t), represents the differ- 
ence between the desired and actual value of the output 
neurons, i.e., 


explicitly on the system parameters; therefore, the “di- 
rect effect” vanishes, i.e., 


e n {t) = 


a n (i) - 


if n € So 
if n € Sj U Sjj 


(4) 


In our model, the internal dynamical parameters of 10 
interest are the strengths of the synaptic interconnec- 
tions, T nm , the characteristic decay constants, k„, and 
the gain parameters, They can be represented as a 
vector of M [where: M=N 2 +2N] components 


P=}Tn 


, Taw, *!»••• KNf yu 


»7iV) 


(5) 


15 


We will assume that the elements of p are statistically 
independent. Furthermore, we will also assume that, for 
a specific choice of parameters and set of initial condi- 2Q 
tions, a unique solution of Eq. (1) exists. Hence, the state 
variables u are an implicit function of the parameters p. 

In the rest of this paper, we will denote the ]i th element 
of the vector p by p^ (jll= 1, . . . , M). 

Traditionally, learning algorithms are constructed by 
invoking Lyapunov stability arguments, i.e., by requir- 
ing that the error functional be monotonically decreas- 
ing during learning time, r. This translates into 


25 


dE _ M dE dpfi 
dr ” jxil dPi, ‘ dr <U 


One can always choose, with 7j>0 


< 6 > 30 


dp\L 

dr 


- -7? 


dE 

dp\L 


(7) 


which implements learning in terms of an inherently 
local minimization procedure. Attention should be paid 
to the fact that Eqs. (1) and (7) may operate on different 
time scales, with parameter adaptation occurring at a 
slower pace. Integrating the dynamical system, Eq.(7), 
over the interval [r, t+At], one obtains, 


35 


40 


r r -f At , 
p^r + A r) = Pp,(r) - 7j J dr 


( 8 ) 
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Equation (8) implies that, in order to update a system 
parameter p^, one must evaluate the “sensitivity” (i.e., 50 
the gradient) of E, Eq. (3), with respect to p^in the 
interval [r, r+Ar]. Furthermore, using Eq. (3) and 
observing that the time integral and derivative with 
respect to p», commute, one can write 

55 


4E-- [ tf ^-dt= [ tf ^dt + f 
dp? ) t d Pp J to 8 Pp J 




dE 

d U 


*P\L 


■dt 


(9) 


3 F 
*P]L 


= 0 


(10a) 


Since F is known analytically (viz. Eqs. (3) and (4)), 
computation of aF/du is straightforward. Indeed 


Zu n 




(10b) 


This sensitivity expression has two parts. The first term 60 
in the Right Hand Side (RHS) of Eq.(9) is called the 
“direct effect”, and corresponds to the explicit depen- 
dence of the error functional on the system parameters. 
The second term in the RHS of Eq. (9) is referred to as 
the “indirect effect”, and corresponds to the implicit 65 
relationship betweenjthe error functional and the sys- 
tem parameters via u. In our learning formalism, the 
error functional, as defined by Eq. (3), does not depend 


Thus, to enable evaluation of the erro^ gradient using 
Eq. (9), the “indirect effect” matrix 3u/0p should, in 
principle, be computed. 

TEACHER FORCING 

The neural activation dynamics specified by Eqs. (1) 
and (2) does not include explicit information regarding 
the desired network output. If these equations are used 
in conjunction with the learning formalism described in 
the previous section, the network parameters (i.e., the 
elements of p) will be modified at the end of a trajec- 
tory, i.e., at time t/, as shown schematically in FIG. lb. 
Such a parameter adaptation is based upon the total 
error between the desired and the actual output of the 
network, accumulated over the interval [t 0 ,t/]. Refer- 
ring to FIG. lb, a neural network 2 is stimulated by a 
time-varying training vector I(t) to produce a time- 
varying output vector u(t). A subtractor 4 subtracts the 
output vector u(t) from a time- varying target vector a(t) 
to produce a time- varying error vector e(t). An integra- 
tor 6 integrates the error vector e(t) over the time per- 
iod of the time-varying training vector I(t). At the end 
of the time period, the result of this integration is used 
by a gradient descent algorithm to change the parame- 
ters (e.g., the synapse weights) of the neural network in 
such a manner as to reduce the output of the integrator 
6 in the next time period. In our earlier analogy to real- 
life behavior, this would correspond to a parent staying 
home, letting his child ride a bicycle and, after each 
trial, telling him all the errors he made. “Conventional” 
supervised learning operates in this fashion, and usually 
takes a great deal of iterations to produce the desired 
results. 

In order to overcome this difficulty, we consider the 
concept of teacher forcing, i.e., driving the output neu- 
rons to desired values in finite time. Williams and Zipser 
[Williams, R. J., and Zipser, 'D. (1988) “A learning 
algorithm for continually running fully recurrent neural 
networks,” Technical Report ICS Report 8805, UCSD, 
La Jolla, Calif. 92093] disclose forcing in a similar con- 
text. Their focus, however, is on discrete time problems. 
To highlight the differences between the two ap- 
proaches we make the following observations. By defi- 
nition, the conventional output of a network at time step 
(t-bl), without teacher forcing, is a function of the 
external inputs to the network and of the networks’ - 
states at time step (t), i.e., in our notation, 

U n (t+ \)=gn[Iit),Uj(t),p\ 

where neSo, kS / and jeS/US^USo- To introduce 
teacher forcing, Williams and Zipser replace the output 
of the network with the desired output values at time 
step (t). This means that 

u n {t+ 1 )=gn[Ii{t),Uj(t),a n (t),p\ 
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where neS o, ieS/and jeS/US#. The network parame- 
ters can be updated either at the end of each time step, 
or at the end of the trajectory, i.e., at time t/. A sche- 
matic block diagram of this model, in which the param- 
eters are updated at the end of the trajectory, is given in 
FIG. 2. Referring to FIG. 2, at time t the neural net- 
work is “forced” to an output vector equal to the cur- 
rent target vector a( t ). The neural network then re- 
sponds to the current training vector I(t) to produce an 
output vector u(t+l) at the next time step t+1. The 
subtractor 4 subtracts the output vector u(t+l) from 
the target vector a(t+l) of the next time step t+1 to 
produce an error vector e(t+ 1). Thereafter, the opera- 
tion of the model of FIG. 2 is analogous to that of FIG. 
lb. The temporal behavior of this model is illustrated in 
FIG. 3, in which the neuron outputs are forced to the 
training target (zero-error) values at the end of each 
time step. Since the network outputs, u„(t+ 1), neSo, are 
dependent upon the desired values a n (t) of the network 
outputs at time step t, the algorithm can be interpreted 
as training the network to capture the velocity of given 
points on the trajectory, rather than the trajectory itself. 
In our earlier analogy, each time interval may be 
viewed as a learning session at the end of which the 
parent, is correcting the child’s performance. 

The teacher forcing paradigm of the present inven- 
tion, on the other hand, stems from feedback control. In 
such a scheme, with continuous network dynamics, the 
error between the actual and the desired outputs is fed 
back, as inputs to the network output set neurons. A 
schematic block diagram of the invention is presented in 
FIG. 4. Referring to FIG. 4, on a simplistic level the 
operation of the invention is analogous to the model of 
FIG. lb discussed above. However, the invention modi- 
fies the error vector e(t) by a function X(t) and feeds the 
modified error vector back to the neural network 2 in 
real time. Preferably, this feedback is applied directly to 
the inputs of the array of output neurons of the neural 
network 2. As can be seen, the parameters of the net- 
work are updated based upon the error accumulated 
over the length of the trajectory, i.e., over the interval 
Again, by analogy, this scheme corresponds to a 
parent accompanying his child and holding the bicycle 
during the trajectory, to keep him on the right track as 
much as possible. At the end of the trajectory the parent 
would explain to his child what went wrong and where, 
so that corrective action can be taken for the next 
round. In order to incorporate this teacher forcing into 
the neural learning formalism presented earlier, the 
time-dependent input to the neural activation dynamics, 
Eq.(l), i.e., I n (t) as given by Eq. (2), is modified to read: 


( di) 

a n (f) if ne Si 

0 if n € S H 

“ Uni*)] 1 * if n € So 

At this stage, X and & are assumed to be positive con- 
stants. The purpose of the term [a^t)] 1 -^ to insure 
that I n (t) has the same dimension as a„(t) and u n (t). It 
(1989) has been demonstrated that in general, for 
£=(2i+ l)/(2j+l), i<j and i and j strictly positive 
integers, an expression of the form [a n — uj^induces a 
terminal attractor phenomenon for the dynamics de- 
scribed be Eq. (1). Barhen et al. [Barhen, J., Toomarian, 


N. and Gulati, S. (1990) “Adjoint operator algorithms 
for faster learning in dynamical neural networks,” in 
David S. Touretzky (Ed.), Advances in Neural Informa- 
tion Processing Systems, Vol. 2, pp. 498-508, San Mateo, 
5 Calif. (Morgan Kaufmann); and, Barhen, J., Toomarian, 
N. and Gulati, S. (1990) “Application of adjoint opera- 
tors to neural learning,” Applied Mathematical Letters, 
Vol. 3, No. 3, pp. 13-18] have considered terminal at- 
tractor dynamics induced from the input set, rather than 
10 the output set. So- They have observed that such a 
dynamics enables to learn time-independent mappings 
much faster than backpropagation. This provided the 
motivation for choosing /3=7/9 for the numerical simu- 
lations described below in this specification. Simula- 
15 tions with other positive constants, such as /3 = 1, have 
produced, qualitatively, similar results, albeit over a 
longer training period. A study of the sensitivity of the 
results to the choice of /3 is beyond the scope of this 

specification. 

90 r 

When learning is successfully completed [i.e., 
e„(t)— 0], teacher forcing will vanish, and the network 
will revert to the conventional dynamics given by Eqs. 
(1) and (2). However, there might be instances where 
the error functional can not be reduced to zero, imply- 
ing that the teacher forcing term will not vanish as 
learning proceeds. Thus, a discrepancy in results be- 
tween the learning and recall mode of the network 
should be expected. In an attempt to overcome this 
problem, we recall another lesson from life. When a 
parent teaches his child to ride a bicycle, at early stages 
he keeps his hands on the bicycle, accompanying the 
child. However, as soon as the child shows some 
learned skills in controlling himself, the parent will take 
35 his hands off more and more often, to let the child ride 
independently. In this vein, the teacher’s intervention in 
the learning process preferably decreases as learning 
progresses. Specifically, in Equation (1) X may be mod- 
ulated in time as function of the error functional, ac- 
40 cording to 

X(r)= 1— e - (12) 

The above expression should be understood as indi- 
45 eating that, while X varies on the learning time scale, it 
remains at essentially constant levels during the itera- 
tive passes over the interval [t 0 ,tj\. 

The behavior in a time continuum of the neural net- 
work in the training architecture of FIG. 4 is illustrated 
50 in FIG. 5, in accordance with the temporal evolution- 
ary behavior defined by equation (1). At a given time t, 
the output of a given output neuron is u(t) while the 
target value for that neuron is a(t), which differs from 
the actual neuron output by an error e(t). The feedback 
55 of the error e(t), illustrated in FIG. 4, reduces the error 
at the next time differential, t+dt, by an amount fie(t)]- 
which is a function of the error e(t). Thus, without the 
invention, the neuron output at the next time differential 
t+dt would have been u'(t+dt), but with the invention 
60 the error at t+dt is reduced by f[e(t)] to produce a 
neuron output u(t+dt) which is closer to the target 
output a(t+dt). The overall result is that the total error 
E(r) of equation (3) is reduced. In accordance with 
equation (12), the amount of correction, namely the 
65 proportion of the error e(t) fed back to the output neu- 
ron, is reduced as the total error E(r) of equation (3) is 
reduced at the end of each learning cycle of time dura- 
tion Ar=[to,t/]. 
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A significant advantage of the invention is that it 
works in the time continuum of the differential equation 
of Equation (3), while the technique of FIGS. 2 and 3 is 
a numerical simulation not realizable using analog neu- 
rons. 

FIG. 6 illustrates a very tutorial example correspond- 
ing to the architecture of FIG. 4, in which the error e(t) 
reduced by a factor of l-exp[E(r)] is directly fed back to 
the inputs of the output neurons. As shown in FIG. 6, 
the neural network 10 includes a set of input neurons 2, 10 
a set of hidden neurons 14 and a set of output neurons 
16. The neurons 12, 14, 16 are selectively intercon- 
nected through weighted synapses (not shown in FIG. 

6) whose weights are determined, along with the gains 
of the neurons, during a preliminary training exercise. 15 
During this exercise, a training set of time-dependent 
neuron inputs are applied during a predetermined time 
interval to the inputs of the input neurons 12 which 
produces a set of neuron outputs u(t). An error vector 
e(t) is determined by a subtractor IS subtracting the 20 
vector of neuron outputs u(t) from the vector of target 
neuron outputs a(t). All elements of the error vector e(t) 
are squared and summed and integrated over the prede- 
termined time interval by the integrator 20 to produce 
the total error E(r) of Equation 3 at the end of the 25 
current training cycle, which is stored in a register 22. 

A multiplier 24 multiplies each component of the error 
vector e(t) by the factor l-exp[E(r)], and the product is 
applied as feedback to the input, of the corresponding 
output neuron 16. A conventional gradient, descent, 30 
algorithm 26, using the output of the integrator 20 and 
the current values of the neuron gains and synaptic 
weights of the neural network 10, computes the desired 
changes to the gains and weights at the end of the pre- 
determined time interval, which are then implemented 35 
in the neural network 10. The process is then repeated 
in successive cycles with a cyclic period equal to the 
predetermined time interval, until the total error E(r) 
reaches zero. 

The operation of the system of FIG. 6 is illustrated in 40 
FIG. 7. Preliminarily, the neuron temporal behavior 
during the evolutionary learning process is defined by 
the differential equation of Equation (I) (block 30 of 
FIG. 7) and a training set is defined for the inputs to the 
input neurons 12 and for target outputs of the output 45 
neurons 16 (block 32 of FIG. 7). The training set neuron 
inputs are time dependent functions over the predeter- 
mined time interval. Next., the training set neuron in- 
puts are applied to the inputs of the input neurons 12 for 
the predetermined time interval Ar=[t 0 ,t^ (block 34 of 50 
FIG. 7) while the errors e(t) between the outputs of the 
output neurons 16 and the desired target outputs are 
monitored (block 36). The squares of the errors are 
summed and integrated over the predetermined time 
period (block 38) to produce the total error E(t) for the 55 
current learning cycle. The gradient descent algorithm 
is then performed (block 40) to compute the changes to 
each of the neural network parameters (e.g., neural 
gains and synaptic weights), and these changes are then 
added to the corresponding neural network parameters 60 
(block 42). If the total error E(r) of the current learning 
cycle is zero (or below some predetermined threshold), 
then the training session is finished (YES branch of 
block 44). Otherwise (NO branch of block 44), the sys- 
tem proceeds to the next learning cycle (block 46) and 65 
the process is repeated starting at block 34 of FIG. 7. 

The preferred embodiment of the invention is illus- 
trated in FIG. 8. In FIG. 8, the error e(t) is scaled before 
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being applied as feedback to the neural network 10. 
First, the error e(t) is raised to a selected exponential 
power j3 by a processor 50, while the target output a(t) 
is raised to a complementary exponential power 1 — ft 
5 by a processor 52. The results are combined by a multi- 
plier 54 and the product is input to the multiplier 24. 
The gradient descent, algorithm 26 transmits neural 
gain adjustments to the neurons 12, 14, 16 and transmits 
synaptic weight adjustments to the synapses 53 in order 
to adjust the neuron gains and synapse weights at the 
end of each time interval. The gradient descent algo- 
rithm computes these adjustments based upon the out- 
put of the integrator 20 in a well-known manner. The 
skilled worker may devise various alternative tech- 
niques for scaling e(t) depending upon the specific ap- 
plication of the invention. 

GRADIENT DESCENT ALGORITHMS 

The efficient computation of system response sensi- 
tivities (e.g., error functional gradients) with respect to 
all parameters of a network’s architecture plays a criti- 
cally important role in neural learning. As mentioned 
previously herein, the gradient descent, algorithm 26 
may be any suitable gradient descent algorithm of the 
prior art. The following describes how one of the best 
gradient descent algorithms is employed in the inven- 
tion. 

Direct Approach Gradient Descent Algorithm 

Let us differentiate the activation dynamics, Eq. (1), 
including the teacher forcing, Eq. (11), with respect to 
p^,. We observe that the time derivative and partial 
derivative with respect to p^ commute. Using the short- 
hand notation 3(. . . )/app.=(. . .) jp , we obtain a set of 
equations referred to as “Forward Sensitivity Equa- 
tions” (FSEs): 


u n,p. + A n m u m,p. — ^n,]x ? > 0 
^n,fJL “0 t — 0 

in which 

Anm = ( K /i “ 7/: & n " 7n S n ? nm (14) 

S n ,» - (15) 

— u n&pp.,Kn + y nS n ^ u rn$pp.,Tnm 5“ S n ^2 T nm U m + I n ) %PlL,yn 

In the above expressions, g n r represents the derivative of 
g n with respect to its arguments, 6 denotes the Kro- 
necker symbol and is defined as a nonhomogeneous 
“source”. The source term contains all explicit deriva- 
tives of the neural activation dynamics, Eq. (1), with 
respect to the system parameters, p/x. Hence, it is pa- 
rameter dependent and its size is (NxM). The initial 
conditions of the activation dynamics, Eq.(l),_are ex- 
cluded from the vector of system parameters p. Thus, 
the initial conditions of the FSEs will be taken as zero. 
Their solution will provide the matrix au/3p needed for 
computing the “indirect effect” contribution to the 
sensitivity of the error functional, as specified by Eq. 
(9). This gradient descent algorithm is, essentially, simi- 
lar to the scheme proposed in the above-referenced 
publication by Williams and Zipset (1989). Computation 
of the gradients using the forward sensitivity formalism 
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requires solving Eq. (13) M times, since the source term, 

S „ tt i, explicitly depends on p^. This system has N equa- 
tions, each of which requires multiplication and summa- 
tion over N neurons. Hence, the computational com- 
plexity, measured in terms of multiply-accumulates, 5 
scales like N 2 per system parameter, per time step. Let 
us assume, furthermore, that the interval [to,t/| is discre- 
tized into L time steps. Then, the total number of multi- 
ply-accumulate operations scales like N 4 L. Clearly, 
such a scheme exhibits expensive scaling properties, and io 
would not be very practical for large networks. On the 
other hand, since the FSEs are solved forward in time, 
along with the neural dynamics, the method also has 
inherent advantages. In particular, there is no need for a 
large amount of memory. Since u M>p , has N 3 +2N 2 com- 15 
ponents, the storage requirement scales as 0(N 3 ). 

If the foregoing is employed for the gradient descent 
algorithm 26 of FIG. 6, then the step of performing a 
gradient descent algorithm of FIG. 7 (block 40 of FIG. 

7) may be broken into steps 40a, 406 and 40c as illus- 20 
trated in FIG. 9b. Specifically, the first step (block 
40aof FIG. 9b) of the gradient descent algorithm 26 is to 
derive the forward sensitivity equations ( Equations 
13-15) from the neural learning behavior (Equation 1). 
The next step is to solve the forward sensitivity equa- 25 
tions once for each of the M neural network parameters 
(block 40 b of FIG. 9b). The third step (block 40c of 
FIG. 96) is to compute the partial derivative of each 
neuron output u(t) with respect to each of the M net- 
work parameters. Finally, the computation step of 30 
block 42' of FIG. 9b employs the integral of this deriva- 
tive to compute the change to the corresponding net- 
work parameter at the end of the current learning cycle. 

NUMERICAL SIMULATIONS 

35 

The embodiment of FIG. 8 has been applied to the 
problem of learning two trajectories: a circle and* a 
figure eight in computer simulations. Results of apply- 
ing prior art techniques to these problems can be found 
in the literature, and they offer sufficient complexity for 40 
illustrating the computational efficiency of our pro- 
posed formalism. 

In the following computer simulations, the network 
that was trained to produce these trajectories using the 
present invention involved 6 fully connected neurons, 45 
with no input, 4 hidden and 2 output units. An addi- 
tional “bias” neuron was also included. In these simula- 
tions, the dynamical systems were integrated using a 
first order finite difference approximation. The neuron 
sigmoidal nonlinearity was modeled by a hyperbolic 50 
tangent. Throughout, the decay constants k„, the neural 
gains y„ 9 and X were set to one. Furthermore, fi was 
selected to be 7/9. For the learning dynamics, At was 
set to 6.3 and 77 to 0.015873. The two output units were 


required to oscillate according to 55 

as(t)=A sin cot (16a) 

a6(t)=A cos cot (16b) 

for the circular trajectory, and, according to 60 

d 5 (t)=A sin cot (17a) 

a^(t)=A sin 2cot (17b) 

65 


for the figure eight trajectory. Furthermore, we took 
A =0.5 and co=l. Initial conditions were defined at 
to=0. Plotting <15 versus ae produces the “desired” 
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trajectory. Since the period of the above oscillations is 
27r, t/=2?r time units are needed to cover one cycle. We 
selected At =0.1, to cover one cycle in approximately 
63 time steps. 

Circular Trajectory 

In order to determine the capability and effectiveness 
of the algorithm, three cases were examined. As initial 
conditions, the values of u„ were assumed to be uniform 
random numbers between —0.01 and 0.01 for the simu- 
lation studies referred in the sequel as “Case — 1” and 
“Case — 2”. For Case — 3, we set u* equal to zero, except 
U6 which was set to 0.5. The synaptic interconnections 
were initialized to uniform random values between 
—0.1 and +0.1 for all three experiments. 

CASE— 1 

The training was performed over t/=6.5 time units( 
i.e., 65 time intervals). A maximum number of 500 itera- 
tions was allowed. The results shown in FIG. 10 were 
obtained by starting the network with the same initial 
conditions, u„(0), as used for training, the learned values 
of the synaptic interconnections, T nm , and with no 
teacher forcing (\=0). As we can see, it takes about 2 
cycles until the network reaches a consistent trajectory. 
Despite the fact that the system’s output was plotted for 
more than 15 cycles, only the first 2 cycles can be distin- 
guished. FIG. 13 demonstrates that most of the learning 
occurred during the first 300 iterations. 

CASE— 2 

Here, we decided to increase the length of the trajec- 
tory gradually. A maximum number of 800 learning 
iterations was now allowed. The length of the training 
trajectory was 65 time intervals for the first 100 itera- 
tions, and increased every 100 iterations by 10 time 
intervals. Therefore, it was expected that the error func- 
tional would increase whenever the length of the trajec- 
tory was increased. This was indeed observed, as may 
be seen from the learning graph, shown in FIG. 13. The 
output of the trained network is illustrated in FIG. 11. 
Here again, from 15 recall cycles, only the first two 
(needed to reach the steady orbit) are distinguishable 
and the rest overlap. Training using greater trajectory 
lengths yielded a recall circle much closer to the desired 
one than in the previous case. From FIG. 13, one can 
see that the last 500 iterations did not enhance dramati- 
cally the performance of the network. Thus, for practi- 
cal purposes, one may stop the training after the first 
300 iterations. 

CASE— 3 

The selection of appropriate initial conditions for u* 
plays an important role in the effectiveness of the learn- 
ing. Here, all initial values of u n were selected to be 
exactly zero except the last unit, where U6— 0.5 was 
chosen. This corresponds to an initial point on the cir- 
cle. The length of the trajectory was increased succes- 
sively, as in the previous case. In spite of the fact that 
we allowed the system to perform up to 800 iterations, 
the learning was essentially completed in about 200 
iterations, as shown in FIG. 13. The results of the net- 
work’s recall are presented in FIG. 12, which shows an 
excellent match. 
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Figure Eight Trajectory 

For this problem, the synaptic interconnections were 
initialized to uniform random values between — 1 and 
+ 1. As initial conditions, the values of u n were assumed 5 
to be uniform random numbers between —0.01 and 

0. 01. The following three situations were examined. 

CASE — 4 

The training was performed over t/^=6.5 time units( 1° 

1. e., 65 time intervals). A maximum number of 1000 
iterations was allowed. The results shown in FIG. 14 
were obtained by starting the network with the same 
initial conditions, u*(0), as used for training, the learned 
values of the synaptic interconnections, T nm , and with 15 
no teacher forcing (\=0). As we can see, it takes about 

3 cycles until the network reaches a consistent trajec- 
tory. Despite the fact that the system’s output was plot- 
ted for more than 15 cycles, only the first. 3 cycles can 
be distinguished. 20 

CASE— 5 

Here, we again decided to increase the length of the 
trajectory gradually. A maximum number of 1000 itera- 
tions was now allowed. The length of the training tra- 
jectory was 65 time intervals for the first 100 iterations, 
and was increased every 100 iterations by 5 time inter- 
vals. Therefore, it was again expected that the objective 
functional would increase whenever the length of the 3Q 
trajectory was increased. This was indeed observed, as 
may be seen from the learning graph, shown in FIG. 17 . 
The output of the trained network is illustrated in FIG. 

15. Here again, from 15 recall cycles, only the first three 
(needed to reach the steady orbit) are distinguishable, 35 
and the rest overlap. As a direct result of training using 
greater trajectory lengths, orbits much closer to the 
desired one than in the previous case were obtained. 

CASE— 6 

40 

The learning in this case was performed under condi- 
tions similar to CASE — 5. but with the distinction that 
A was modulated according to Eq. (12). The results of 
the network’s recall are presented in FIG. 16 , and dem- 
onstrate a dramatic improvement with respect to the 45 
previous two cases. 

It is important to keep in mind the following observa- 
tions with regard to the foregoing simulation results: 

1) For the circular trajectory, X was kept constant 
throughout the simulations and not modulated accord- 50 
ing to Eq. (12). As we can see from FIG. 13, in cases 1 
and 2 the error functional was not reduced to zero. 
Hence, a discrepancy in the functional form of the neu- 
ral activation dynamics used during the learning and 
recall stages occurred. This was a probable cause for 55 
the poor performance of the network. In case 3, how- 
ever, the error functional was reduced to zero. There- 
fore, the teacher forcing effect vanished by the end of 
the learning. 

2) For the figure eight trajectory, the differences 60 
between cases 5 and 6 lies in the modulation of A, (i.e., 
the amplitude of the teacher forcing). Even though in 
both cases the error functional was reduced to a negligi- 
ble level, the effect of the teacher forcing in case 5 was 
not completely eliminated over the entire length of the 65 
trajectory. This points toward the fact that modulation 
of A not only reduces the number of iterations but, also 
provides higher quality results. 
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In order to assess the effectiveness of the new 
method, the foregoing simulations applied it to two 
examples of representative complexity which have re- 
cently been analyzed in the open literature. We have 
demonstrated that a circular trajectory can be learned 
in approximately 200 iterations compared to the 12000 
reported by Pearlmutter (1989). A figure eight trajec- 
tory was achieved in under 500 iterations compared to 
20000 previously required. Most important, however, is 
the quality of the obtained results. The trajectories com- 
puted using our new method are much closer to the 
target trajectories than was reported in previous studies. 

While the invention has been described in accordance 
with the preferred embodiment in which the feedback is 
reduced as a function of the error E(r) over successive 
learning cycles, it may be that in some instances such a 
decrease will not be steady and may not even occur in 
individual cycles. Moreover, other schemes to modu- 
late the feedback in accordance with the invention may 
be employed. For example, in those cases where a 
stead); decrease in the error E(r) over successive cycles 
may be generally expected, the feedback could be mod- 
ulated as a function of the number of cycles indepen- 
dently or dependently of the error E(r). Moreover: 
while the invention has been described with reference 
to a gradient descent, algorithm used to adjust both the 
synapse weights and the neuron gains, any subset or all 
of these neural network parameters or other neural 
network parameters may be adjusted. For example, it 
may be that only the synapse weights would be adjusted 
at the end of each repetitive cycle. 

While the invention has been described in connection 
with training a neural network with time- varying train- 
ing vectors and target vectors, the invention may also 
be applied in training neural networks with time-invari- 
ant training vectors and target vectors. 

While the invention has been described in detail by 
specific reference to preferred embodiments of the in- 
vention, it is understood that variations and modifica- 
tions thereof may be made without departing from the 
true spirit and scope of the invention. 

What is claimed is: 

1. Apparatus for training a neural network compris- 
ing input, hidden and output sets of neurons having 
respective neuron gains interconnected by respective 
synapses having respective synapse weights to produce 
at outputs of said output set of neurons a time-varying 
target vector in response to a time-varying training 
vector applied to inputs of said input set of neurons, said 
time-varying training and target vectors being defined 
for a predetermined time interval, said apparatus com- 
prising: 

means for applying respective elements of said time- 
varying training vector to the inputs of respective 
ones of said input set of neurons during said prede- 
termined time interval; 

means for measuring an error vector constructed 
from the differences between the output values 
produced at the outputs of said output set of neu- 
rons and corresponding elements of said time- vary- 
ing target vector during said predetermined time 
interval; 

means for determining a function of each individual 
element of said error vector during said predeter- 
mined time interval; 

means for feeding back said function of each individ- 
ual element of said error vector to inputs of respec- 
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tive ones of said set of output neurons during said 
predetermined time interval; 

means responsive to said error vector and to current 
values of said neuron gains and synapse weights for 
changing at least one of (a) said neuron gains and 
(b) said synapse weights in accordance with a gra- 
dient descent algorithm at the end of said predeter- 
mined time interval to decrease the magnitude of 
the said error vector; and wherein, 

said means for applying, said means for measuring, 
said means for determining, said means for feeding 
back and said means for changing all operate to- 
gether in repetitive cycles, each of said cycles hav- 
ing a time duration equal to said predetermined 
time interval; and, 

said means for determining a function of each individ- 
ual element of said error vector comprises means 
for modulating said function in accordance with 
measurements of said error vector by said means 
for measuring during a previous one of said repeti- 
tive cycles. 

2. The apparatus of claim 1 wherein said means for 
modulating said function comprises means for multiply- 
ing said function by a factor that depends upon the 
measurements of all elements of said error vector mea- 
sured by said means for measuring during an immedi- 
ately preceding one of said repetitive cycles. 

3. The apparatus of claim 2 wherein said factor is 1- 
exp, wherein E(r) is an integral over said predetermined 
time interval of a sum of squares of all elements of the 
error vector measured during said immediately preced- 
ing one of said repetitive cycles. 

4. The apparatus of claim 1 wherein said means for 
determining a function of each element of said error 
vector comprises means for scaling said function. 

5. The apparatus of claim 4 wherein said means for 
scaling comprise means fox raising each element of said 
error vector to an exponential power of /3, raising the 
corresponding element of said time-varying target vec- 
tor to an exponential power of 1-/3, and multiplying 
them together, wherein fi is a rational number less than 
one. 

6. The apparatus of claim 5 wherein / 3 is on the order 
of approximately 7/9. 

7. The apparatus of claim 1 wherein the outputs of 
each of said neurons obey a set of differential equations 
during said predetermined time interval, said differen- 
tial equation being a function of said neuron gains, said 
synapse weights and said time-varying training and 
target vectors, and wherein said means for computing 
said changes by performing a gradient descent algo- 
rithm comprise: 

means for deriving a set of sensitivity equations from 
said set of differential equations; 

means for solving said set of sensitivity equations 
once for each one of a set of parameters of said 
neural network, said parameters comprising at least 
one of (a) said neuron gains and (b) said synapse 
weights; 

means for computing a differential of the output value 
of each neuron with respect to corresponding ones 
of said parameters; and 

means for computing a change to be made to each 
one of said parameters at the end of said predeter- 
mined time interval by integrating over said prede- 
termined time interval the product of said differen- 
tial and a corresponding element of said error vec- 
tor. 
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8. The apparatus of claim 1 further comprising: 

means for progressively decreasing said function in 

successive ones of said repetitive cycles. 

9. The apparatus of claim 8 wherein said means for 
5 progressively decreasing said function decreases said 

function as a function of a decrease in a functional of 
said error vector over said successive ones of said repet- 
itive cycles. 

10. The apparatus of claim 9 wherein said functional 
10 comprises an integral of a function of each element of 

said error vector measured during a previous one of said 
repetitive cycles. 

11. A method for training a neural network compris- 
ing input, hidden and output sets of neurons having 

15 respective neuron gains interconnected by respective 
synapses having respective synapse weights to produce 
at outputs of said output set of neurons a time-varying 
target vector in response to a time-varying training 
vector applied to inputs of said input set of neurons, said 
20 time-varying training and target vectors being defined 
for a predetermined time interval, said method compris- 
ing: 

applying respective elements of said time-varying 
training vector to the inputs of respective ones of 
25 said input set of neurons during said predetermined 
time interval; 

measuring an error vector constructed from the dif- 
ferences between the output values produced at the 
outputs of said output set of neurons and corre- 
30 sponding elements of said time-varying target vec- 
tor during said predetermined time interval; 

determining a function of each individual element of 
said error vector during said predetermined time 
interval; 

35 feeding back said function of each individual element 
of said error vector to inputs of respective ones of 
said set of output neurons during said predeter- 
mined time interval; 

changing at least one of (a) said neuron gains and (b) 
40 said synapse weights in response to said error vec- 
tor and to current values of said neuron gains and 
synapse weights in accordance with a gradient 
descent algorithm at the end of said predetermined 
time interval to decrease the magnitude of the said 
45 error vector; and wherein, 

said applying, said measuring, said determining, said 
feeding back and said changing is performed in 
repetitive cycles, each of said cycles having a time 
duration equal to said predetermined time interval; 
50 and, 

said determining a function of each individual ele- 
ment of said error vector comprises modulating 
said function in accordance with measurements of 
said error vector by said measuring during a previ- 
55 ous one of said repetitive cycles. 

12. The method of claim 11 wherein said modulating 
said function comprises multiplying said function by a 
factor that depends upon the measurements of all ele- 
ments of said error vector measured during an immedi- 

60 ately preceding one of said repetitive cycles. 

13. The method of claim 12 wherein said factor is 
1-exp wherein E(r) is an integral over said predeter- 
mined time interval of a sum of squares of all elements 
of the error vector measured during said immediately 

65 preceding one of said repetitive cycles. 

14. The method of claim 11 wherein said determining 
a function of each element of said error vector com- 
prises scaling said function. 
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15. The method of claim 14 wherein said scaling 
comprise raising each element of said error vector to an 
exponential power of /3, raising the corresponding ele- 
ment of said time- varying target vector to an exponen- 
tial power of 1-/3, and multiplying them together, 5 
wherein /3 is a rational number less than one. 

16. The method of claim 15 wherein f3 is on the order 
of approximately 7/9. 

17. The method of claim 11 wherein the outputs of 

each of said neurons obey a set of differential equations 10 
during said predetermined time interval, said differen- 
tial equations depending upon said neuron gains, said 
synapse weights and said time-varying training and 
target vectors, and wherein said computing said 
changes comprises: 15 

deriving a set of sensitivity equations from said set of 
differential equations; 

solving said set of sensitivity equations once for each 
one of a set of parameters of said neural network, 
said parameters comprising at least, one of (a) said 20 
neuron gains and (b) said synapse weights; 
computing a differential of the output value of each 
neuron with respect to corresponding ones of said 
parameters; and 

computing a change to be made to each one of said 25 
parameters at the end of said predetermined time 
interval by integrating over said predetermined 
time interval the product of said differential and a 
corresponding element of said error vector. 

18. The method of claim 11 further comprising: 30 

progressively decreasing said function in successive 

ones of said repetitive cycles. 

19. The method of claim 18 wherein said progres- 
sively decreasing said function comprises decreasing 
said function in accordance with a decrease in a func- 35 
tional of said error vector over said successive ones of 
said repetitive cycles. 

20. A method of training a neural network to output 

a target vector in response to a training vector, said 
method comprising: 40 

feeding back to neuron inputs of said neural network 
a function of an error vector corresponding to a 
difference between said target vector and a current 
output vector of said neural network; and, 
stimulating said neural network with said training 45 
vector while feeding back said function of the error 
vector; 

modulating said function in accordance with a factor 
dependent upon elements of said error vector; 
said feeding back is performed in repetitive cycles of 50 
a cyclic period and cyclically adjusting parameters 
of said neural network, and wherein said adjusting 
is performed in accordance with a measurement of 
said error vector during a previous one of said 
repetitive cycles. 55 

21. Apparatus for training a neural network compris- 
ing input, hidden and output sets of neurons having 
respective neuron gains interconnected by respective 
synapses having respective synapse weights to produce 

at outputs of said output set of neurons a target vector 60 
in response to a training vector applied to inputs of said 
input set of neurons, said apparatus comprising: 

means for applying respective elements of said train- 
ing vector to the inputs of respective ones of said 
input set of neurons; 65 

means for measuring an error vector constructed 
from the differences between the output values 
produced at the outputs of said output set of neu- 
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rons and corresponding elements of said target 
vector; 

means for determining a function of each individual 
element of said error vector; 

means for feeding back said function of each individ- 
ual element of said error vector to inputs of respec- 
tive ones of said set of output neurons; 

means responsive to said error vector and to current 
values of said neuron gains and synapse weights for 
changing at least one of (a) said neuron gains and 
(b) said synapse weights in accordance with a gra- 
dient descent algorithm to decrease the magnitude 
of the said error vector; and wherein, 

said means for applying, said means for measuring, 
said means for determining, said means for feeding 
back and said means for changing all operate to- 
gether in repetitive cycles, each of said cycles hav- 
ing a time duration equal to a predetermined time 
interval; and, 

said means for determining a function of each individ- 
ual element of said error vector comprises means 
for modulating said function in accordance with 
measurements of said error vector by said means 
for measuring during a previous one of said repeti- 
tive cycles. 

22. The apparatus of claim 21 wherein said means for 
modulating said function comprises means for multiply- 
ing said function by a factor that depends upon the 
measurements of all elements of said error vector mea- 
sured by said means for measuring during an immedi- 
ately preceding one of said repetitive cycles. 

23. The apparatus of claim 22 wherein said factor is 
1-exp, wherein E(r) is an integral over said predeter- 
mined time interval of a sum of squares of all elements 
of the error vector measured during said immediately 
preceding one of said repetitive cycles. 

24. The apparatus of claim 21 wherein said means for 
determining a function of each element of said error 
vector comprises means for scaling said function. 

25. The apparatus of claim 24 wherein said means for 
scaling comprise means for raising each element of said 
error vector to an exponential power of /3, raising the 
corresponding element of said target vector to an expo- 
nential power of 1-/3, and multiplying them together, 
wherein /3 is a rational number less than one. 

26. The apparatus of claim 25 wherein j3 is on the 
order of approximately 7/9. 

27. The apparatus of claim 21, wherein the outputs of 
each of said neurons obey a set of differential equations 
during said predetermined time interval, said differen- 
tial equation being a function of said neuron gains, said 
synapse weights and said training and target vectors, 
and wherein said means for computing said changes by 
performing a gradient descent, algorithm comprise: 

means for deriving a set of sensitivity equations from 
said set of differential equations: 

means for solving said set of sensitivity equations 
once for each one of a set of parameters of said 
neural network, said parameters comprising at 
least, one of (a) said neuron gains and (b) said syn- 
apse weights; 

means for computing a differential of the output value 
of each neuron with respect to corresponding ones 
of said parameters; and 

means for computing a change to be made to each 
one of said parameters at the end of said predeter- 
mined time interval by integrating over said prede- 
termined time interval the product of said differen- 
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tial and a corresponding element of said error vec- 
tor. 

28. The apparatus of claim 21 further comprising: 

means for progressively decreasing said function in 

successive ones of said repetitive cycles. 

29. The apparatus of claim 28 wherein said means for 
progressively decreasing said function decreases said 
function as a function of a decrease in a functional of 
said error vector over said successive ones of said repet- 
itive cycles. 

30. The apparatus of claim 29 wherein said functional 
comprises an integral of a function of each element of 
said error vector measured during a previous one of said 
repetitive cycles. 

31. A method for training a neural network compris- 
ing input, hidden and output sets of neurons having 
respective neuron gains interconnected by respective 
synapses having respective synapse weights to produce 
at outputs of said output set of neurons a target vector 
in response to a training vector applied to inputs of said 
input set of neurons, said method comprising: 

applying respective elements of said training vector 
to the inputs of respective ones of said input set of 
neurons; 

measuring an error vector constructed from the dif- 
ferences between the output values produced at the 
outputs of said output set of neurons and corre- 
sponding elements of said target vector; 

determining a function of each individual element of 
said error vector; 

feeding back said function of each individual element 
of said error vector to inputs of respective ones of 
said set of output neurons; 

changing at least one of (a) said neuron gains and (b) 
said synapse weights in response to said error vec- 
tor and to current values of said neuron gains and 
synapse weights in accordance with a gradient 
descent algorithm to decrease the magnitude of the 
said error vector; and wherein, 

said applying, said measuring, said determining, said 
feeding back and said changing is performed in 
repetitive cycles, each of said cycles having a time 
duration equal to a predetermined time interval; 
and, 

said determining a function of each individual ele- 
ment of said error vector comprises modulating 
said function in accordance with measurements of 
said error vector by said measuring during a previ- 
ous one of said repetitive cycles. 
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32. The method of claim 31 wherein said modulating 
said function comprises multiplying said function by a 
factor that depends upon the measurements of all ele- 
ments of said error vector measured during an immedi- 

5 ately preceding one of said repetitive cycles. 

33. The method of claim 32 wherein said factor is 
1-exp, wherein E(r) is an integral over said predeter- 
mined time interval of a sum of squares of all elements 
of the error vector measured during said immediately 

10 preceding one of said repetitive cycles. 

34. The method of claim 31 wherein said determining 
a function of each element of said error vector com- 
prises scaling said function. 

35. The method of claim 34 wherein said scaling 

15 comprise raising each element of said error vector to an 

exponential prover of / 3 , raising the corresponding ele- 
ment of said target vector to an exponential power of 
1 -/ 3 , and multiplying them together, wherein /3 is a 
rational number less than one. 

20 36. The method of claim 35 wherein /3 is on the order 

of approximately 7/9. 

37. The method of claim 31 wherein the outputs of 
each of said neurons obey a set of differential equations, 
during said predetermined time interval, said differen- 

25 tial equations depending upon said neuron gains, said 
synapse weights and said training and target vectors, 
and wherein said computing said changes comprises: 

deriving a set, of sensitivity equations from said set, of 
differential equations; 

30 solving said set of sensitivity equations once for each 
one of a set of parameters of said neural network, 
said parameters comprising at least one of (a) said 
neuron gains and (b) said synapse weights; 

computing a differential of the output value of each 

35 neuron with respect to corresponding ones-of said 
parameters; and 

computing a change to be made to each one of said 
parameters at the end of said predetermined time 
interval by integrating over said predetermined 

40 time interval the product of said differential and a 
corresponding element of said error vector. 

38. The method of claim 31 further comprising: 

progressively decreasing said function in successive 

ones of said repetitive cycles. 

45 39. The method of claim 38 wherein said progres- 

sively decreasing said function comprises decreasing 
said function in accordance with a decrease in a func- 
tional of said error vector over said successive ones of 
said repetitive cycles. 
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